Alibaba Group
Abstract:Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.
Abstract:Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and HumanEval. K/V perturbations remain flat before flips, while low top-1/top-2 logit margins expose much of the flip risk. MarginGate turns these observations into a verifier policy: it keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. We evaluate on four datasets, calibrating on MATH500 and transferring to GSM8K, SharedGPT, and HumanEval. MarginGate restores 100% sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56%/15.05% verifier trigger rates, reducing LLM-42's latency increment by 2.23x/1.99x relative to always-on verification. On DSR1-Distill-Qwen-7B, the same policy reaches determinism in a harder regime at 49.50% triggers.
Abstract:Single-cell RNA sequencing (scRNA-seq) profiles large numbers of cells but loses spatial context, whereas spatial transcriptomics (ST) preserves partial spatial structure at lower resolution. Most existing integration methods either deconvolve spot mixtures or map cells onto a measured spot lattice, which ties reconstructions to a fixed grid and slide-specific coordinate systems, a limitation that is especially problematic in unpaired settings. We propose GEARS, a geometry-first framework that reconstructs an intrinsic single-cell spatial geometry guided by ST, without relying on cell-type labels, histological images, or cell-to-spot assignment. GEARS first learns a domain-invariant expression encoder that aligns ST spots and dissociated cells, and then trains a permutation-equivariant generator with a diffusion-based refiner with EDM-style preconditioning to generate local spatial geometries under pose-invariant supervision derived from ST coordinates. At inference, GEARS reconstructs geometry on many overlapping subsets of scRNA-seq cells, aggregates predicted pairwise distances across subsets, and solves a global distance-geometry problem to obtain canonical two-dimensional coordinates and a dense distance matrix. Extensive quantitative and qualitative experiments, including cross-section generalization, show that GEARS consistently improves global distance preservation, local neighborhood fidelity, and spatial distribution alignment compared to strong spatial mapping and deconvolution baselines.
Abstract:LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hover, click, resize, or gameplay. Evaluation from screenshots can miss these failures, and filtering discards many pages that are still repairable. We introduce HTMLCure, a browser experience framework that evaluates HTML after the system has interacted with it. The evaluator executes the page across viewports and interaction states, records deterministic browser evidence, and gives the VLM curated keyframes from the executed trajectory rather than isolated screenshots. The same state signal drives a closed loop repair engine: HTMLCure diagnoses the current page, chooses a state specific repair family, runs each candidate again, and exports quality cleared pages for SFT. On a 97K prompt corpus, this expands the directly usable seed into a candidate pool of 63703 quality cleared pages, from which we construct the final refined SFT set of 40K pages. Under the same backbone and training recipe, HTMLCure-27B-Refined reaches 50.6 on HTMLBench-400 with 45.2% deterministic test case pass, placing it in the same performance band as strong reference rows such as Kimi-K2.6 and GPT-5.4. On the released MiniAppBench validation split, it reaches 81.2 average, improving raw 27B SFT by 15.3 points and approaching the level of strong reference systems.
Abstract:Charts play a critical role in conveying numerical data insights through structured visual representations. However, semantic visual understanding and numerical reasoning requirements hinder the accurate description of charts, interpreting a challenging task in chart summarization. Despite recent advancements in visual language models (VLMs), approaches lack robust mechanisms for verifying statistical fact correctness and are computationally heavy. To address this gap, this paper explores a strategy of using zero-shot learning to motivate the lightweight VLMs to perform computational reasoning, via Python programs as intermediaries to derive valid summary statistics for chart understanding. Specifically, we introduce a novel chart-to-dictionary auxiliary task, offering a more flexible representation compared to traditional chart-to-table methods, making it particularly well-suited for integration with the Program-of-Thought (PoT) strategy. Experimental results demonstrate our strategy performs on par with existing chart summarization methods across semantic and factual metrics. Code is available on https://anonymous.4open.science/r/ZeroShot-PoT-C2T-5A6B.
Abstract:Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with $k$ free variables (i.e., $\text{EFO}_k$ queries) is a crucial yet challenging problem, as it requires ranking answer tuples in $\mathcal{E}^k$, where $\mathcal{E}$ denotes the entity set of a KG. This quickly becomes intractable as $k$ grows. Consequently, existing benchmarks and methods rely on marginal rankings over individual variables; however, marginal rankings are a poor proxy for the true joint ranking of tuples. Building on neural symbolic search for $\text{EFO}_1$ queries, we propose Neural Scalable Symbolic Search (NS3), a budgeted framework that approximates joint ranking without enumerating $\mathcal{E}^k$. NS3 (i) answers marginalized sub-queries to obtain necessary candidate sets, (ii) merges multiple free variables into hypernodes whose domains are pruned and controlled by a dynamic budget $B$, and (iii) progressively reduces an $\text{EFO}_k$ query to an $\text{EFO}_{k-1}$ query over a budgeted reduced domain. Across three standard KG datasets, NS3 substantially improves joint ranking performance while retaining strong marginal accuracy. We further release a joint-ranking benchmark that extends existing $\text{EFO}_1$ datasets to $k=3$, enabling systematic evaluation of multi-variable queries. Our code is provided in https://github.com/HKUST-KnowComp/NS3_KDD2026.
Abstract:Post-training via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is crucial for enhancing reasoning in Multimodal Large Language Models (MLLMs), yet existing paradigms often reach a performance bottleneck due to the limitations of static data. While current methods leverage self-reflection or self-evolution to push these boundaries, they still suffer from cognitive drift and hallucinated reasoning paths caused by low-quality synthetic data. To address these challenges, we propose Anchor Evolution (AnE), a new paradigm that integrates truth-anchored data curation and model evolution, achieving faithful and steady performance gains at the reasoning frontier. Specifically, we propose Truth Anchor Expansion, which pinpoints the model failing frontier via trajectory rollouts and leverages ground-truth databases to retrieve high-fidelity anchors for faithful data curation. Subsequently, we introduce the Scaffold-Stripping Mechanism to internalize reasoning capabilities. This mechanism first anchors reasoning paths via scaffold-augmented supervision to mitigate the learning complexity and distribution drift of direct SFT on raw data, then leverages RL to strip the scaffold template, thereby effectively transitioning the reasoning paths into intrinsic model capabilities. Experimental results on multimodal reasoning benchmarks show that our method substantially advances the model performance frontier, improving the base model by 10.3\% across eight multimodal benchmarks and achieving state-of-the-art results. The code will be made publicly available.
Abstract:Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real-time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real-world forecasting. In this work, we introduce RealBench, a next-generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out-of-distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low-latency operational analysis and a large-scale global in-situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high-impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event-specific metrics that better reflect real-world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis-based metrics and real-world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next-generation AI weather forecasting systems. The benchmark implementation is available at: https://github.com/lixruize-del/NWP-Benchmark.
Abstract:Video world models have achieved strong visual realism, but this does not ensure that their dynamics are truly governed by actions. In this work, we argue that action faithfulness should be understood through the compositional structure of actions, which in many embodied settings follows a group structure (e.g., SE(2) for navigation). Based on this insight, we formalize action-conditioned world modeling as realizing a group action on the state space, providing a principled criterion for evaluating dynamics beyond visual quality. To operationalize this framework, we propose a unified approach that enforces identity, inverse, and composition consistency via latent-space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state-of-the-art video world models without degrading perceptual quality.
Abstract:Tool-augmented LLM agents increasingly access the same tool type through multiple functionally equivalent providers, such as web-search APIs, retrievers, or LLM backends exposed behind a shared interface. This creates a provider-routing problem under runtime load: the router must choose among providers that differ in latency, reliability, and answer quality, often without gold labels at deployment time. We introduce LQM-ContextRoute, a contextual bandit router for same-function tool providers. Its key design is latency-quality matching: instead of letting low latency offset poor answers in an additive reward, the router ranks providers by expected answer quality per service cycle. It combines this capacity-aware score with query-specific quality estimation and LLM-as-judge feedback, allowing it to adapt online to both load changes and provider-quality differences. On the main web-search load benchmark, LQM-ContextRoute improves F1 by +2.18 pp over SW-UCB while staying on the latency-quality frontier. In a high-heterogeneity StrategyQA setting, LQM-ContextRoute avoids additive-reward collapse and improves accuracy by up to +18 pp over SW-UCB; on heterogeneous retriever pools, it improves NDCG by +2.91--+3.22 pp over SW-UCB. These results show that same-function tool routing benefits from treating latency as service capacity, especially when runtime pressure and provider-quality heterogeneity coexist.